NASA  Contractor  Report  194906 
ICASE  Report  No.  94-27 


AD-A280  468 

IIHIHIIHHIH 


ICASE 


ANALYSIS  OF  OPTIMISTIC 
WINDOW-BASED  SYNCHRONIZATION 


a-PTIC 

ELECT  EW\ 

NfeJUNS  119941  I 

S#  B  U 


Phillip  M.  Dickens 
David  M.  Nicol 
Paul  F.  Reynolds,  Jr. 
J.M.  Duva 


/ 


rn  O^AT'rr'' 


7  TnrrECTED  8 


\  — 


94-18964 


Contract  NAS  1  - 1 9480 
April  1994 


Institute  for  Computer  Applications  in  Science  and  Engineering 
NASA  Langley  Research  Center 
Hampton,  VA  2368 1  -000 1 


Operated  by  Universities  Space  Research  Association 


94  u  2 


Best 

Available 

Copy 


Analysis  of  Optimistic  Window-based 
Synchronization  * 

Phillip  M.  Dickens  David  M.  Nicol 

ICASE  College  of  William  and  Mary 

Paul  F.  Reynolds,  Jr.  J.  M.  Duva 

University  of  Virginia  University  of  Virginia 


Abstract 

This  paper  studies  an  analytic  model  of  parallel  discrete-event  simulation,  compar¬ 
ing  the  costs  and  benefits  of  extending  optimistic  processing  to  the  YAWNS  synchro¬ 
nization  protocol.  The  basic  model  makes  standard  assumptions  about  workload  and 
routing;  we  develop  methods  for  computing  performance  as  a  function  of  the  degree 
of  optimism  allowed,  overhead  costs  of  state-saving,  rollback,  and  barrier  synchroniza¬ 
tion,  and  LP  aggregation.  This  allows  an  approximation-based  analysis  of  the  range 
of  situations  under  which  optimism  is  a  beneficial  extension  to  YAWNS.  We  find  that 
limited  optimism  is  beneficial  if  the  processor  load  is  sparse,  but  that  aggregating  LPs 
onto  processors  improves  YAWNS  relative  performance. 
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1  Introduction 


Discrete-event  simulations  model  physical  systems.  The  literature  on  parallel  discrete-event 
simulation  (PDES)  usually  views  a  physical  system  as  a  set  of  communicating  physical 
processes,  each  of  which  is  represented  in  the  simulation  by  a  logical  process  (LP).  LPs 
communicate  through  time-stamped  messages  reflecting  changes  to  the  system  state.  A 
time-stamp  reflects  an  instant  where  a  state  change  occurs  in  the  physical  process  model. 

Parallel  discrete  event  simulation  poses  difficult  synchronization  problems,  due  to  the 
underlying  sense  of  logical  time.  Each  LP  maintains  its  own  logical  clock  representing  the 
time  up  to  which  the  corresponding  physical  process  has  been  simulated.  The  fundamental 
problem  is  to  determine  when  an  LP  may  execute  a  known  future  event,  and  in  so  doing 
advance  its  logical  clock.  If  an  LP  advances  its  logical  clock  too  far  ahead  of  any  other 
LP  in  the  system  it  may  receive  a  message  with  a  time-stamp  in  its  logical  past,  called  a 
straggler.  The  threat  of  stragglers  is  dealt  with  by  saving  the  simulation  state  periodically, 
and  rolling  back  as  appropriate  when  a  straggler  arrives  .  Messages  sent  at  times  ahead  of 
the  straggler’s  time-stamp  must  be  undone.  Fundamental  problems  of  PDES  are  reviewed 
in  Misra  (1986),  Fujimoto  (1990),  Righter  and  Walrand  (1989).  Nicol  and  Fujimoto  (1994) 
give  a  more  current  state-of-the-art  review. 

Most  PDES  synchronization  protocols  fall  into  two  basic  categories.  Conservative  pro¬ 
tocols  (e.g.  (Tandy  and  Misra  1979,  Bryant  1979,  Peacock,  Wong  and  Manning  1979, 
Lubachevsky  1988,  (Tandy  and  Sherman  1989  and  Nicol  1993)  do  not  allow  an  LP  to  pro¬ 
cess  an  event  with  time-stamp  t  if  one  is  unable  to  assert  that  it  will  not  receive  another 
event  with  time-stamp  less  than  t,  at  some  point  in  the  future.  Optimistic  protocols  (e.g. 
Time  Warp,  .Jefferson  (1985))  allow  an  LP  to  process  an  event  before  it  is  known  for  certain 
that  the  LP  will  not  later  need  to  process  an  event  with’ earlier  time-stamp.  Causality  errors 
are  corrected  through  a  rollback  mechanism.  A  more  careful  taxonomy  of  protocol  charac¬ 
teristics  is  detailed  in  Reynolds  (1988);  in  keeping  with  standard  (but  imprecise)  practice, 
we  will  speak  in  terms  of  conservative  and  optimistic  protocbls. 

The  earliest  synchronization  protocols  are  asynchronous — an  LP  synchronizes  solely  on 
the  basis  of  interactions  with  LPs  with  which  it  directly  communicates.  Recently  more 


synchronous  protocols  have  attracted  interest.  While  details  vary,  the  basic  idea  is  to  in¬ 
corporate  barrier  synchronizations  and  global  reductions  on  functions  of  future  simulation1* 
times.  Examples  include  Moving  Time  Window  (Sokol  et  al.  1988  ),  Conservative  Time  Win¬ 
dows  (Ayani  and  Rajaei,  1992),  Conditional  Events  ((Tandy  and  Sherman,  1989),  Bounded 
Lag  (Lubachevsky,  1988),  Synchronous  Relaxation  (Eick  et  al .,  1993),  Bounded  Time  Warp  L 
(Turner  and  Xu,  1992),  Breathing  Time  Buckets  (Steinman,  1991),  and  YAWNS(Nirol,~ 
1993).  The  advantage  to  a  conservative  protocol  is  that  synchronization  information  moves- 
(piickly  through  the  system,  lowering  overhead  costs.  This  efficiency  usually  comes  at  the 
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Figure  1:  Extended  YAWNS  window  is  comprised  of  one  conservative  and  one  optimistic 
subwindow. 

price  of  more  pessimistic  synchronization,  e.g.,  an  LP  A  may  block  against  the  threat  of  a 
receiving  a  message  at  time  t ,  whereas  the  threatened  message  is  actually  from  LP  B  to  LP 
(’.  The  global  mechanisms  allow  for  efficient  computation  of  simulation  times,  like  t,  but  do 
not  handle  routing  details  well.  The  advantage  to  an  optimistic  protocol  is  the  elimination  of 
a  separate  CIVT  (Global  Virtual  Time)  calculation,  and  the  reduction  of  the  risk  of  cascading 
rollbacks.  As  for  the  conservative  methods,  the  price  paid  is  the  reduction  of  asynchrony, 
and  more  limited  opportunities  for  parallelism. 

Our  interest  is  in  the  conservative  YAWNS  protocol,  and  in  determining  conditions  un¬ 
der  which  it  makes  sense  to  extend  it  by  incorporating  optimism.  YAWNS  conservatively 
constructs  a  window  of  simulation  time  within  which  events  on  different  processors  may  be 
concurrently  simulated  (details  follow  in  Section  2).  This  conservative  window  tends  to  be 
small.  However,  under  the  YAWNS  mechanism,  an  LP  that  executes  an  event  outside  of 
the  window  risks  receiving  a  straggler  message.  We  extend  optimism  to  YAWNS  by  ap¬ 
pending  an  optimistic  window  to  the  conservative  window;  when  an  LP  executes  events  in 
the  optimistic  window  it  must  be  prepared  to  deal  with  straggler  messages.  The  advantage 
is  to  increase  the  number  of  events  processed  per  window,  in  hopes  of  lowering  the  amor¬ 
tized  cost  of  of  computing  and  synchronizing  at  the  window.  The  cost  of  the  extension  is 
due  to  state-saving,  rollback,  and  additional  global  synchronization.  The  basic  form  of  the 
extension  is  illustrated  in  Figure  1 — at  simulation  time  t  all  LPs  use  the  standard  YAWNS 
mechanism  to  compute  the  conservative  window  [t,t  -f  (7),  but  also  append  an  optimistic 
window  [<  +  (7,  t  -f-  A).  Processors  synchronize  globally  at  simulation  time  t,  -(-  A.  For  our 
purposes  we  take  A  as  a  user-specified  parameter  that  governs  the  degree  to  which  optimistic 
execution  is  permitted.  We  will  call  the  method  YOW  (YAWNS  Optimistic  Window). 

We  find  that  there  is  a  best  optimistic  window  size  that  is  much  larger  than  YAWNS’s. 
We  derive  formulas  for  YAWNS’  and  YOW’s  performance  as  a  function  of  synchronization, 
state-saving,  and  event-reprocessing  costs.  Using  these  we  determine  that  when  the  problem 
is  sparse — one  fine-grained  LP  per  processor — then  asymptotically  (as  the  number  of  LPs 
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increases)  then  YOW  prevails.  However,  if  we  fix  the  size  of  the  architecture  and  aggregate 
LPs  onto  processors,  then  YAWNS  can  prevail. 

The  remainder  of  the  paper  is  organized  as  follows.  Section  2  describes  our  analytic  model 
and  its  relationship  to  others  in  the  literature.  Section  3  develops  methods  for  approximating 
the  probability  distribution  of  an  LPs  workload,  included  reprocessed  messages  due  to  roll 
backs.  Section  4  applies  those  approximations  to  compare  conservative  YAWNS  with  its 
optimistic  extension,  and  Section  5  presents  our  conclusions. 

2  Model 

Our  analysis  is  of  a  parallelized  queueing  network  simulation.  LPs  represent  servers,  and 
events  occur  when  jobs  either  enter  service,  or  are  received  by  a  queue.  A  job’s  random 
service  time  increases  its  LP’s  simulation  clock  by  the  service  amount.  Event  processing 
cannot  be  interrupted;  furthermore  a  job’s  post-service  destination  is  presumed  to  be  known 
at  the  time  it  enters  service.  The  destination  is  chosen  uniformly  at  random  from  the  set  of 
all  LPs.  We  do  assume  that  the  data  content  and  next  destination  of  a  serviced  job  depend 
upon  the  contents  and  times  of  all  jobs  received  by  the  LP  prior  to  the  event  entering 
service.  Because  of  this,  a  message  reporting  the  job’s  arrival  at  its  new  destination  is  sent 
to  its  recipient  at  the  time  the  job  enters  service.  This  is  called  pre-sending  the  job,  and 
is  an  important  aspect  of  both  YAWNS  and  Time  Warp.  A  message  has  both  a  send-time 
and  receive-time ,  corresponding  to  the  service-entry  and  service  departure  times.  Service 
time  (reflecting  an  advancement  in  simulation  time)  is  also  random,  and  is  exponentially 
distributed  with  rate  ps.  We  assume  that  the  cost  of  processing  a  service-entry  event  or  a 
job  arrival  event  is  unity;  our  expression  of  physical  execution  times  will  be  in  these  units. 

While  simple,  models  like  there  are  the  basis  for  several  analytic  studies.  This  model  is 
similar  to  the  one  studied  by  Gupta,  Akildiz  and  Fujimoto  (Gupta  et  al.,1991)  (which  we’ll 
refer  to  as  GAF)  in  their  study  of  asynchronous  Time  Warp.  The  main  differences  are  that 
we  assume  unit  cost  for  executing  an  event  and  the  GAF  model  assumes  an  exponentially 
distributed  execution  cost;  that  our  model  is  basically  that  of  a  queueing  system  with  single 
servers  and  a  non-preemptive  queuing  discipline  whereas  the  GAF  model  is  of  a  queuing 
system  with  infinite  servers;  our  model  indirectly  reflects  the  effects  of  communication  delay, 
and  the  GAF  model  assumes  instantaneous  communication.  These  differences  are  significant 
enough  to  prevent  us  from  quantitively  comparing  our  model  results  to  GAF’s.  We  do  note 
that  GAF’s  assumption  of  random  event  costs  should  tend  to  worsen  performance  over  our 
model,  but  the  instantaneous  communication  and  infinite  servers  should  tend  to  improve 
it  over  ours.  Furthermore,  one  increases  the  available  parallelism  in  tin1  GAF  model  by 
increasing  the  number  of  messages;  in  our  model  one  must  increase  the  number  of  LPs. 
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Our  model  is  also  loosely  related  to  the  self-initiating  model  studied  by  Nirol  (1991),  and 
is  subsumed  by  Nicol’s  message-initiating  model  in  his  study  of  YAWNS  (Nirol,  1993),  The 
former  model  concentrates  on  the  effects  of  fan-outs  greater  than  one,  and  ignores  the  effects 
of  rollback;  the  latter  model  provides  the  analysis  of  YAWNS  that  we  use  in  this  paper.  The 
bonding  model  of  Eick  et  al.  (1993)  is  closely  related  to  ours,  in  that  it  essentially  describes 
the  behavior  of  a  parallelized  queuing  simulation  identical  to  ours  except  that  a  message 
describing  a  job’s  departure  is  sent  only  at  the  simulation  instant  when  the  job  departs — 
we  assume  the  message  is  sent  at  the  time  the  job  enters  service.  Another  difference  is 
that  we  assume  that  a  re-executed  event  may  send  its  subsequent  message  to  a  different 
IT'  than  before,  whereas  the  bonding  model  assumes  it  is  directed  to  the  same  LP.  Finally, 
the  randomly  uniform  routing  assumption  is  shared  by  the  model  studied  by  Felderman  and 
Kleinrock  (1991),  who  make  different  assumptions  concerning  time-stamp  advancement  and 
event  execution  time. 

Our  analysis  is  unique  in  several  ways.  First,  nearly  all  of  the  afore-mentioned  models 
regard  communication,  state-saving,  and  synchronization  as  negligible.  We  believe  that 
these  same  costs  largely  define  which  synchronization  approach  is  best  suited  for  a  problem, 
and  so  should  be  explicitly  incorporated  in  the  model.  Secondly,  our  analysis  is  of  an 
optimistic  window-based  scheme  where  performance  depends  on  the  level  of  optimism;  in 
this  regard  only  Eick  ef  al.  s  model  is  similar.  Our  analytic  approach  is  different,  but  can 
also  be  extended  to  the  Eick  et  al.  model.  While  we  apply  the  model  to  the  problem 
of  extending  YAWNS,  the  approach  applies  more  generally  to  the  analysis  of  window-based 
optimistic  protocols.  Finally,  only  the  analysis  in  Nicol  (1993)  considers  the  beneficial  effects 
of  aggregating  LPs;  as  we  shall  see,  this  consideration  can  make  it  more  advantageous  to 
fore-go  optimism  in  a  sufficiently  aggregated  case,  whereas  optimism  is  better  in  the  noil- 
aggregated  case. 

Our  analytic  approach  is  computational  and  is  based  on  simplifying  approximations. 
We  develop  an  intuitive  approximation  to  the  probability  distribution  of  the  number  of 
events  processed  by  an  LP  while  executing  a  window.  The  workload  distribution  includes 
reprocessed  events  induced  by  rollbacks.  With  this  distribution  as  a  basis  we  add  overhead 
costs,  and  compute  the  average  execution  cost  (in  real  time)  per  unit  simulation  time. 

Before  proceeding  to  the  analysis,  it  is  useful  to  review  the  YAW’NS  mechanism.  Presume 
that  all  LPs  have  executed  all  events  up  to  simulation  time  /.  Under  the  assumptions  that 
permit  pre- sending  messages,  each  LP  i  can  examine  its  state  and  predict  the  departure  time 
d,{t)  of  the  next  job  to  receive  service,  excluding  the  one  receiving  service  at  /,  assuming 
no  further  message  arrivals  at  i  prior  to  that  job  entering  service.  This  sort  of  lookahead  is 
called  conditional  knowledq<  by  ('handy  and  Sherman  ( 1989).  because  the  validity  of  d,(t)  is 
conditional.  Using  standard  minimum  reduction  techniques,  the  LPs  can  quickly  compute 
w[t)  =  min, {*/,(/)};  the  conservative  YAWNS  window  is  [/,?<;(/)).  Bv  construction,  no  job 


entering  service  during  the  [t,w(t)),  window  also  departs  service.  Coupling  this  feature  with 
message  pre-sending,  no  message  generated  by  an  event  in  [t,w(t.))  has  a  receive  time  in 

[Cte(O). 

We  extend  optimism  to  YAWNS  by  requiring  that  LPs  synchronize  at  the  upper  edge  of 
the  optimistic  window.  The  same  min-reduction  as  before  is  used  to  synchronize,  only  now 
the  result  iu(t)  indicates  the  upper  edge  of  a  “safe  region”  into  which  no  straggler  message  will 
ever  venture.  Only  one  global  synchronization  is  needed  per  window—  a  min-reduction  serves 
both  to  synchronize  the  processors  at  some  time  t,  and  to  compute  w(t).  The  processors 
know  to  synchronize  again  at  time  t  +  A.  No  state-saving  need  be  performed  prior  to  any 
event  in  the  conservative  window. 

One  essential  difference  between  YOW  and  Bounded  Time  Warp  (Turner  1992)  is  the 
computation  and  exploitation  of  the  conservative  window.  Another  is  the  proposed  syn¬ 
chronization  mechanism.  Special  care  must  be  taken  when  synchronizing  at  t.  +  A  since  an 
LP  may  simulate  up  to  that  time  but  then  be  rolled  back.  Bounded  Time  Warp  proposes 
an  essentially  linear  time  (in  the  number  of  LPs)  termination  detection  mechanism.  More 
efficient  methods  can  be  supported  in  hardware  (Reynolds  et  al.  1993),  or  using  special 
algorithms  in  software  (Nicol  1993a).  Our  model  presumes  Nicol’s  software  solution. 

We  assume  that  an  event  at  LP  i  which  is  reprocessed  due  to  rollback  randomly  selects 
a  new  destination  with  each  reprocessing.  This  reflects  the  assumption  that  a  message’s 
content  and  destination  is  a  sensitive  function  of  the  complete  message  history  at  LP  i  up  to 
the  time  where  the  job  enters  service.  Thus  two  messages  are  generated  upon  reprocessing 
an  event,  an  anti-message  to  cancel  its  previous  routing  and  a  new  routing  message  sent 
to  another  (probably  different)  LP.  Like  other  analyses  of  Time  Warp,  we  assume  that  the 
anti-message  exacts  no  computational  cost  at  either  the  sender  or  receiver.  However,  our 
model  will  not  assume  that  an  anti-message  is  received  instantaneously.  We  do  not  model 
the  message  delay  directly,  but  rather  model  the  effects  of  such  a  delay. 


3  Analysis 

Within  any  given  window  of  width  A ,  an  LP  will  execute  (and  re-execute)  a  random  number 
of  events,  say  W .  This  random  variable  (like  all  others  in  our  analysis)  depends  on  A ,  but 
this  dependence  will  not  be  expressed  in  the  notation.  Our  initial  goal  is  to  determine  the 
probability  distribution  of  W;  note  that  this  distribution  is  the  same  for  all  LPs  under  the 
uniformity  assumptions  made.  Given  the  distribution,  we  can  add  overhead  and  execution 
costs,  and  determine  the  mean  time  fi a  required  to  complete  the  window  by  the  processor 
requiring  the  longest  time  to  do  so.  Ha/A  serves  as  our  metric,  measuring  the  average 
execution  time  required  per  unit  simulation  time. 


We  focus  on  “generations”  of  messages,  a  notion  which  arises  as  follows.  Imagine  that 
LP  s  synchronize  at  A,  and  then  each  executes  all  known  events  in  the  window  without  paying 
any  attention  to  any  possible  communications.  The  set  of  messages  sent  during  the  first 
sweep  with  time-stamps  in  [t.t  +  /l)  are  in  generation  1.  Imagine  now  that  an  LP  gathers 
up  all  the  generation  1  messages  sent  to  it.  and  processes  them.  These  must  cause  rollback, 
anti -messages,  and  new  routing  messages.  The  set  of  all  anti-messages  and  new  routing 
messages  are  in  generation  2.  In  general,  a  message  is  in  generation  i  +  1  if  it  is  the  direct 
result  of  a  rollback  caused  bv  a  generation  i  message.  We  denote  the  random  number  of 
generation  i  messages  received  by  an  LP  as  (7,,  and  denote  by  R,  the  random  number  of 
events  processed  as  a  result  of  receiving  generation  i  messages. 

We  assume  that  ,4  is  small  enough  and  the  message  density  is  high  enough  to  ignore  the 
possibility  of  a  job  received  in  [/,/  +  A)  going  into  service  in  that  same  window.  Under  this 
assumption,  the  number  and  times  of  all  service-entry  events  during  [/,A  +  A)  are  known 
after  the  LPs  synchronize  at  /,  and  remain  unaltered  (except  for  destination  and  content) 
while  processing  [/,  /  +  .4).  The  number  of  such  events  at  a  given  LP  is  a  random  variable  .s' 
that  is  Poisson  distributed  with  mean  Afis.  All  other  events  are  job  arrivals,  of  which  there 
are  ./.  which  is  also  Poisson  with  mean  A[is. 

Event  reprocessing  costs  depend  on  how  quickly  tin'  parallel  simulator  receives  and  reacts 
to  straggler  messages.  For  example,  the  analysis  of  Gupta  cf  al.  ( 1991 )  assumes  zero  message 
transmission  delay,  and  that  rollback  occurs  immediately  following  the  complete  processing 
of  whatever  event  is  being  served  at  the  instant  the  straggler  message  arrives.  If  two  or 
more  stragglers  arrive  during  that  processing  time,  the  reprocessing  effect  is  as  though  only 
the  straggler  with  least  time  stamp  was  received,  others  exact  no  additional  cost.  Hut  now 
consider  the  effect  a  communication  delay  may  have  on  the  algorithm.  If  .4  is  small  enough 
an  LP  will  have  few  events  in  a  window:  in  the  time  it  takes  a  message  to  travel  between 
processors,  the  recipient  LI*  will  already  be  ready  to  synchronize  at  time  /  +  .4.  Even  if 
communication  is  faster  it  is  frequently  the  case  (and  we  have  observed  empirically  on  actual 
applications)  that  the  cost  of  probing  for  new  messages  after  each  event  is  prohibitively  high 
on  distributed  memory  architectures,  because  such  a  probe  involves  a  system  call.  To  model 
this  effect,  we  assume  that  if  a  straggler  message  is  received  at  some  time  .s  £  [/./  +  .4)  then 
the  effect  of  that  straggler  is  to  re-execute  al!  events  at  that  LP  from  .s  to  /  +  A ,  and  to  send 
anti-messages  after  all  messages  generated  previously  by  those  events.  If  an  LI’  receives 
A'  generation  i  stragglers,  we  assume  that  each  is  processed  serially,  incurring  A’  separate 
recomputation  costs.  This  assumption  is  reasonable  so  long  as  an  LP  has  only  a  few  events 
in  a  window. 

If  we  define  generation  0  messages  as  corresponding  to  the  service  entry  events  and  job 
arrival  events,  we  write  R{)  =  S  +  ./,  and  express  the  total  number  of  events  processed  in  the 


window  by 


w  =  £ 

i=0 

The  distribution  of  each  reprocessing  cost  R,  depends  on  the  number  and  time-stamps 
of  generation  i  messages.  The  actual  distributions  for  these  messages  are  untractably  com¬ 
plicated,  so  we  will  approximate.  For  generations  i  >  1,  we  assume  that  6’,  has  a  Poisson 
distribution.  Such  an  assumption  is  standard,  since  given  a  total  number  N  generation  i  —  1 
messages,  the  number  arriving  at  an  LP  is  binomial  B(N,  1  /P),  P  being  the  number  of  LPs. 
Binomials  with  large  N  and  small  probabilities  of  success  are  frequently  modeled  as  Poisson. 

We  also  approximate  the  distributional  form  of  the  random  arrival  time  of  a  generation 
i  message.  Each  such  message  corresponds  to  a  service-entry  event  in  some  LP;  the  arrival 
time  is  the  service-entry  time  plus  an  exponential.  Each  service  entry  event  has  some  rank 
reflecting  whether  it  is  the  first,  second,  or  so  on  service  entry  event  in  [t.,t  +  A),  on  its  LP. 
The  arrival  time  distribution  of  the  message  sent  by  the  ilh  service  entry  event  following  time 
t  is  t  plus  the  convolution  of  i  +  1  i.i.d.  exponentials,  i.e.,  an  Erlang-(i  +  1);  we  say  that  the 
arrival  message  has  rank  i  -f  1.  We  will  have  occasion  to  condition  on  the  service-entry  event 
lying  in  [£,  t  +  A),  in  which  case  the  message’s  arrival  time  distribution  is  suitably  modified. 
In  order  for  such  a  message  rn  to  be  sent  (in  generations  >  0),  the  ith  service  entry  event 
must  be  reprocessed,  implying  the  arrival  of  an  earlier  straggler — information  that  alters 
in  s  arrival  time  distribution.  Our  model  does  not  attempt  to  capture  this  distributional 
dependency.  Under  our  simplifying  assumption  then,  every  generation  i  arrival  message  in 
[/.  /  -f  /I)  has  a  time-stamp  whose  distribution  is  /,  plus  some  Erlang  conditioned  on  being 
less  than  A.  Let  denote  the  mean  fraction  of  generation  i  messages  that  have  rank  k. 
Letting  /*,.(. s)  be  the  density  function  for  an  Erlang-A:  conditioned  on  being  less  than  A,  we 
approximate  the  arrival  time  density  function  of  an  arbitrary  generation  i  message  as  the 
mixture  /  +  ££=2 

Table  1  summarizes  our  notation.  All  random  quantities  are  LP-oriented,  rather  than 
system-oriented. 

It  remains  to  determine  weighting  factors  and  the  distributions  for  W,  (7,,  and 

R,.  The  approach  is  to  condition  on  S  +  J  —  k ,  and  determine  distributions  for  (7,-,  /?,, 
and  W  suitably  conditioned,  call  them  G\(k),  and  R,(k),  and  W(A:).  Assuming  that  all 
arrival  messages  are  independent  of  each  other,  we  compute  W(k)  —  Ri(k),  because 

the  individual  random  variables  in  the  convolution  will  be  independent.  It  is  straightforward 
then  to  uncondition  on  S  +  J  (since  S  and  J  are  independent  and  Poisson).  The  values  for 
E[(*i\  and  {«;  *.}  can  be  built  up  with  increasing  i,  as  will  be  shown. 

Let  us  first  consider  E[( 7j],  A  generation  1  message  arises  whenever  a  service-entry  event 
in  +  A)  sends  an  arrival  message  with  time-stamp  less  than  t.  +  A  (all  other  arrival  events 
were  sent  by  service  entry  events  in  previous  windows).  If  we  condition  on  S  =  k  service- 


7 


W  Total  events  processed  in  a  window 
S  Service  entry  events  in  a  window 
J  Job  arrival  events 

Hs  Service  rate  for  queue  server 

6',  Generation  i  arrival  messages  received 

/?,  Events  reprocessed  by  all  generation  i  arrivals 

r,  Events  reprocessed  by  single  generation  i  arrival 
atJ  Fraction  of  generation  i  arrivals  with  rank  j 
fj(s)  Density  function  of  Erlang-j  conditioned  on  .s  <  A 
Fj(s)  Cummulative  distribution  function  for  Erlang-j 

B(n,p)  A  binomial  random  variable  with  parameters  n  and  p 
H}(s)  Cummulative  distribution  function  of  Erlang-j  conditioned 
on  sum  of  first  (j  —  1)  stages  being  less  than  A 


Table  1:  Summary  of  Notation 


entry  events  in  [£,  t  -f  A),  the  joint  distribution  of  their  times  in  [t,t,  +  A)  is  identical  to  that 
of  k  independent  [t,t  +  A]  uniform  random  variables  (Ross  1983,  pg.  37).  Choosing  one  of 
these  k  uniformly  at  random,  the  probability  that  its  arrival  message  lies  outside  of  [/,  t  +  A) 
is  given  by 

rt+A  J 

Pr{Arrival  message  time  for  service  entry  event  >  t  +  A  \  S  =  k}  =  /  —e  dv 

Jt  A 

1.0  - 

A  ‘ 

This  leads  to  the  observation  that  the  mean  number  of  arrival  messages  generated  in  [/,  t  +  A) 
that  fall  outside  of  [<,  t-\-  A)  is  1 .0  —  e~^,A.  Since  the  mean  total  number  of  arrival  messages 
generated  in  [t,  t  f  A)  is  ps A,  we  obtain 

E[Gt]  =  n,A- (1.0 

Values  for  the  {ai.fc}  are  also  easily  derived.  For  an  arrival  message  to  be  of  rank  j  it 
is  necessary  that  the  Erlang  associated  with  its  arrival  time  t  +  n  be  less  than  t  +  A.  From 
Bayes  Theorem  we  obtain 

a i  j  =  Pr{A  generation  1  arrival  message  in  [t,t  +  A)  has  rank  j  >  2} 

=  Pr{  a  ~  Erlang-j  |  a  <  A  } 

F,(A) 

TZaFkW  J-  ' 
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where  F \  is  the  cumulative  distribution  of  an  Erlang-A1  with  rate  parameter  /is. 

We  now  turn  to  the  analysis  for  higher  generations.  Suppose  E[Gt]  and  the  values  {n, 
art'  known  for  generation  i:  condition  on  .S’  +  .7  =  k  and  consider  the  distribution  of  R,(k). 
I’nder  our  assumptions,  an  arrival  at  time  t  +  e  6  [t.t  +  .4)  will  cause  the  reprocessing  of 
every  known  arrival  event  and  service-entry  event  with  time-stamp  between  /  -f-  r  and  t  +  .4. 
(liven  >'  4- ./  =  k  we  view  the  placement  in  time  of  events  on  [/.A  +  .4)  as  that  of  A-  uniforms 
on  [/./  +  .4]  (Ross  1984.  pg.47).  As  a  consequence,  the  number  of  events  reprocessed  by  a 
rollback-inducing  arrival  at  time  /  +  r  has  the  distribution  of  a  Binomial  B(k.(A  —  e)/.4). 
representing  the  sum  of  A-  Bernoullis  with  success  probability  (.4  —  r)/,4.  Coupling  this  fact 
with  the  assumed  distributional  form  of  generation  i  messages  we  compute 

Pr{»  events  reprocessed  bv  generation  i  message  |  J  =  A}  = 

"  ±  aiMv)?r{B(k.(A-v)/A)^n}dv.  (1) 

-°  j=2 

Equation  (1)  approximates  the  distribution  of  random  variable  r,(A).  the  random  number 
of  events  reprocessed  by  a  single  generation  i  message,  conditioned  on  .S'  +  .}  =  k.  We 
ignore  here  the  fact  that  the  arrival  message  is  itself  an  arrival  event,  and  that  the  set  of 
known  arrival  events  is  continuously  in  flux  through  successive  generations.  Accepting  this 
we  compute  the  distribution  of  H,(k)  as  the  random  convolution  of  M  independent  instances 
of  r,(k).  M  being  Poisson  with  rate  E[(!x\. 

Values  {«,j}  are  computed  in  a  similar  fashion.  If  we  condition  on  a  generation  i  arrival 
at  time  f  +  r  and  condition  on  there  being  m  service-entry  events  in  [t.t  ■ f  .4).  then  the  number 
of  these  falling  between  t  -f  r  and  t  -f-  .4  is  binomial.  The  probability  that  a  generation  i  +  1 
message  of  rank  j  is  generated  by  this  arrival  is  zero  if  there  aren't  enough  service-entry 
events,  i.e..  if  j  —  1  >  w.  Otherwise  it  is  the  probability  that  the  ( j  —  1  )5'  service-entry  event 
occurs  after  r.  and  that  the  message  it  generates  falls  within  [/./  +  .4).  I  his  gives 

/i.ii  jjm)  =  F’r {generation  i  message  creates  a  rank  j  generation  i  +  1  message  |  .s’  =  w  } 
=  Hj(A)x  /  ^2at'jf,(r)Vr{B(m.v/A)  >  j  -  l}dr  (2) 

where  we  recall  that  Hj  is  the  cumulative  distribution  function  of  an  Erlang- j  conditioned  on 
the  sum  of  its  first  j  —  1  exponential  stages  being  less  than  .4.  Hj(A )  gives  us  the  probability 
that  a  reprocessed  rank-(j  —  1 )  service-entry  event  produces  a  message  in  the  next  generation. 

Figure  2  helps  to  explain  these  ideas.  A  situation  with  .s’  =  5  is  shown,  where  an  arrival 
message  at  time  t  +  v  falls  ahead  of  the  first  three  service  entry  events.  1  he  service  entry 
events  ahead  of  the  arrival  have  ranks  4  and  •')  respectively.  Arcs  illustrate  the  send/receive 
time  difference  between  the  messages  sent  by  the  reprocessed  events:  the1  rank  1  event  message 


t+v 

I 


t  5=5  t+A 


Figure  2:  Reprocessing  of  a  rank  4  service-entry  event  generates  a  rank  5  message  for  the 
next  generation. 


falls  within  the  window,  the  rank  5  event  message  does  not.  In  order  for  there  to  he  a  rank 
5  message  generated,  the  Ath  ranked  service  event  must  lie  to  the  right  of  v,  as  must  the 
receive  time  of  its  message.  The  distribution  of  that  receive  time  is  an  exponential  added  to 
the  distribution  of  4(/'  service*  event,  the  latter  of  which  is  a  conditional  Erlang-4. 

For  each  rank  j  let  6I+1  j  be  the  result  of  unconditioning  equation  (2)  on  S'.  Then,  recalling 
that  each  reprocessed  service-entry  event  generates  two  messages  (assumed  to  have  the  same 
time-stamp),  the  mean  number  of  generation  i  -f- 1  messages  with  rank  j  is  2  x  E[(it\  x 
and  the  coefficients  {n!+j,j}  are  given  by 


(i.+i.j  =  E[Ci,} 


Xitel  W*  7 

Finally,  the  mean  number  of  arrival  messages  in  the  next  generation  is  simply 


eKM«»X> 


*.  =  2 


(  sing  these  recursions  one  may,  for  every  S + J  ~  k,  compute  the  distribution  of  Ri[k),  for 
all  generations  i  —  1,2,....  Conditioned  on  S  +  J  =  k,  the  random  variables  Ro{k ),  R\  (k), .  .  . 
may  be  taken  to  be  independent  (because  the  processes  driving  them  are  highly  randomized 
arrivals  from  elsewhere),  whence  we  may  compute  the  distribution  of  the  convolution  W(k)  = 
Rt(k)-  Finally,  knowing  this  distribution  for  each  S'+.J  ■=  k,  we  compute  the  distribution 
of  W  by  unconditioning  on  S  +  ./  (known  to  be  Poisson). 

Of  course,  any  computer  program  calculating  these  distributions  must  truncate  the  in¬ 
finite  sums.  Taking  /r.,  =  1,  we  have  found  that  summing  over  the  first  twelve  generations 
yields  accurate  numbers  when  A  €  (0,2 fis). 

The  distribution  of  W  describes  the  workload  of  a  single  LP,  in  terms  of  the  numbers  of 
events  processed.  With  large  numbers  of  LPs  and  the  randomizing  message  routing,  we  may 
treat  the  LP  workloads  as  being  independent  random  variables.  Under  this  assumption  it  is 
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straightforward  to  express  the  expected  maximum  workload  among  N  LPs.  Letting  M\(A) 
he  the  maximum  workload,  we  know  that  for  every  non-negative  integer  w 

Pv{Mn(A)  <w}  =  Pr{W  < 

so  that 

OO 

E[Mn{A)]  =  £  Pr{MN(A)  >  w} 

«•= 0 

oo 

=  S(I.O-  Pr{W  <  w}N). 

u*=0 

Numerical  problems  may  arise  computing  yz  when  y  is  small  and  x  is  large;  a  good 
ap|)roximation  for  E[A/,v(/4)]  is  the  so-called  characteristic  maximum ,  used  for  instance  in 
Kick  ft  al.  ( 1993).  Given  ;V,  the  characteristic  maximum  of  W  is  the  smallest  value  wr  such 
that  Pr{U'  >  »’,  }  <  1/-V.  Since  W  is  discrete,  we  further  refine  the  estimate  with  linear 
interpolation  of  W's  cumulative  distribution  function  between  wc  and  wc  —  1,  in  essence 
creating  a  continuous  version  W7  and  solving  for  wc  such  that  PrIW7  >  ii\.}  =  1  /TV.  wc 
estimates  E 

The  ut  ility  of  our  model  is  illustrated  by  Figure  3,  where  we  compare  model  predictions  of 
/:[.V/,;4(/f )]  and  Ps[M\om(A)]  with  measurements  (from  simulation  of  our  model)  for  varying 
values  of  A.  Each  measurement  point  is  estimated  from  one  hundred  window  replications. 
As  our  purpose  is  only  to  ensure  that  the  model  captures  general  trends  we  omit  confidence 
intervals.  We  see  that  the  model  predicts  performance  tolerably  well  over  a  range  where  the 
predictions  span  a  factor  of  ten  between  smallest,  and  largest,  although  there  is  a  breakdown 
at  the  larger  end. 

It  is  also  instructive  to  consider  how  the  fraction  of  committed  events  (those  events  that 
are  not  later  reprocessed)  behaves  as  a  function  of  A.  This  is  illustrated  in  Figure  4,  where  we 
plot  the  ratio  of  the  expected  maximum  committed  workload  on  a  processor  to  the  expected 
maximum  total  workload,  for  64  and  1024  LPs.  For  both  curves  shown,  the  fraction  of  useful 
work  decreases  linearly  in  A  after  a  certain  point.  This  suggests  that  under  the  assumptions 
of  our  model,  it  does  not  make  sense  to  increase  A  indefinitely.  This  is  explained  in  the 
section  to  follow. 


4  Comparison  with  YAWNS 

It  is  instructive  to  consider  how  E[M^(A)]  behaves  as  a  function  of  A.  E[M^(A)\  is  basically 
the  product  of  three  terms,  (i)  the  number  of  message  generations  required  until  all  LPs  have 
finished  the  window,  (ii)  the  average  number  of  rollbacks  per  generation,  (iii)  the  average 
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Figure  3:  Comparison  of  observed  and  predicted  mean  maximum  events  processed  in  a 
window  by  any  LP. 

number  of  messages  reprocessed  per  rollback.  Our  simulations  have  suggested  that  the 
number  of  generations  grows  linearly  in  A ,  an  observation  that  agrees  with  the  analysis  of 
Eick  ct  al.  (1993).  The  number  of  messages  reprocessed  each  rollback  also  increases  linearly 
in  A,  for  the  simple  reason  that  increasing  the  window  size  introduces  new  events  at  the  top 
of  the  window  to  be  rolled  back  along  with  the  ones  which  were  rolled  back  with  smaller 
windows.  The  average  number  of  rollbacks  per  generation  is  also  linear  in  A,  because  each 
arrival  message  is  assumed  to  cause  the  re-evaluation  of  all  later  messages.  E[M^(A)}  is  at 
least  a  cubic  function  of  A,  so  that  the  cost  per  simulation  time  unit  E[Mn(A)]/A  (whose 
units  are  execution  time  per  simulation  time  unit)  is  at  least  quadratic  in  A.  This  suggests 
that  there  may  be  some  A’"  minimizing  this  cost.  Figure  5  confirms  this  intuition.  In  fact,  it 
is  interesting  to  note  that  A‘  appears  to  be  slightly  less  than  fi3  =  1.  This  too  is  in  agreement 
with  the  model  of  Eick  ef  al.,  even  though  the  models  and  costs  are  different.  We  conclude 
that  is  an  excellent  choice,  and  in  the  remainder  presume  this  equality. 

It  turns  out  that  the  behavior  of  E[Mn((i3)]/ fi3  in  N  is  an  almost  perfectly  linear  function 
of  log  N  in  the  range  considered,  with  £[M/v(/*s)]/ ~  log  N- f  2.9.  To  incorporate  the  effects 
of  state-saving,  we’ll  assume  that  the  per-event  cost  of  state-saving  is  a  factor  of  a,  so  that 
the  cost  of  executing  n  events  with  attendant  state-saving  is  an.  Note  that  this  model  does 
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Figure  4:  Fraction  of  committed  events  as  a  function  of  >4,  for  64  and  1024  LPs. 

not  presume  that  state  is  saved  each  event;  it  only  presumes  that  the  aggregate  state-saving 
overhead  amortized  over  events  is  a. 

E[Mn{A)\  does  not  incorporate  the  cost  of  synchronization.  To  include  these  costs  we 
must  consider  how  synchronization  is  performed  in  a  computation  of  this  type.  A  software 
solution  described  by  Nicol  (1993a)  has  every  LP  engaging  in  synchronization  activity  once  it 
finds  itself  apparently  at  the  synchronization  point.  We  could  assume  some  synchronization 
cost  for  each  and  every  straggler  message,  however  this  seems'excessive.  Instead  we’ll  assume 
that  the  number  of  synchronizations  are  those  one  would  incur  by  synchronizing  at  the  end  of 
each  generation;  empirical  evidence  (Nicol,  1993a)  suggests  that  each  such  synchronization 
costs  roughly  twice  that  of  a  conventional  synchronization.  Our  simulation  studies  show  that 
a  window  of  width  A  =  fi3  requires  2.5  generations  on  average,  a  figure  that  is  relatively 
insensitive  to  the  number  of  LPs.  Taking  B  as  the  execution  cost  of  a  conventional  barrier 
synchronization  the  overall  execution  cost  per  unit  simulation  time  given  N  LPs  is 

Coptan(N)  *  «(log2  N  +  2.9)  +  5R  (3) 

Note  that  our  assumed  synchronization  cost  structure  does  not  affect  the  optimality  of 
A *  =  since  synchronization  costs  then  grow  linearly  in  A.  Also  note  that  B  shows 
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Figure  5:  E[Mn(A)\/A  as  a  function  of  A,  for  64  and  1024  LPs. 


no  dependence  on  N.  Asymptotically  it  must  grow  with  log2  N,  however  we  presume  that 
the  cost  of  executing  an  event  is  large  enough  to  overshadow  this  dependence  before  N 
becomes  extremely  large. 

Now  consider  YAWNS.  Nicol  (1993)  established  that  the  average  width  of  the  conservative 
window  is  at  least  n,yJir/(2N)  ~  1.25 fi,/\/N.  In  windows  this  small,  the  average  maximum 
number  of  events  processed  by  any  LP  is  no  larger  than  2,  for  large  N  it  is  much  closer  to  1. 
Including  the  barrier  synchronization,  YAWN’s  cost  per  unit  simulation  time  is  no  greater 
than 


»(N)  = 


(2  +  B)y/N 


One  consequence  of  A *  Hs  is  that  for  large  N  there  is  relatively  little  advantage  to  avoid 
state-saving  within  the  YAWNS  conservative  window,  because  the  optimistic  window  is  so 
much  larger.  For  instance,  if  N  =  100,  then  only  about  12%  of  a  window  avoids  state-saving. 
It  costs  very  little  to  compute  the  conservative  window,  and  so  if  convenient  ought  to  be 
done.  However,  the  performance  benefits  from  doing  so  are  not  large. 

We  may  use  equations  (3)  and  (4)  to  compare  the  approaches,  given  values  for  overhead 
costs.  At  a  higher  level  we  observe  that  YOW  has  an  0(log2  N)  cost  while  YAWNS  has  an 
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Figure  6:  Function  specifying  LP  threshold  N*  after  which  YOW  is  better  than  YAWNS. 

()(y/N)  cost.  For  sufficiently  large  N,  the  optimistic  approach  will  always  achieve  a  lower 
cost.  How  large  must  that  N  be?  We  depict  this  graphically  in  Figure  6,  plotting  the  solution 
(to  cv)  of  equation  C0ptim  —  C ynwns  =  0,  as  a  function  of  log2  N  and  for  various  values  of  B. 
Solutions  o  =  o( N,  B )  <  1  are  plotted  as  1,  since  state-saving  can  never  accelerate  the  cost 
of  executing  an  event.  For  any  given  value  of  o*  and  known  value  B ,  one  can  determine 
the  N*  for  which  o*  =  ck(N*,B),  and  determine  that  YOW  is  better  than  YAWNS  for  all 
N  >  N*.  Imagine  that  state-saving  doubles  the  cost  of  executing  an  event.  Plotting  the 
line  o  =  2  we  look  for  its  intersection  with  the  various  synchronization  cost  curves;  N' s 
associated  with  the  intersection  define  N*.  For  instance,  if  B  —  0  then  YOW  is  better  for 
N  >  128.  If  B  —  1.0  however,  then  YOW  needs  only  N  >  100,  and  if  B  =  10.0  it  needs  only 
N  >  40.  YAWNS  is  clearly  impacted  more  strongly  by  increasing  synchronization  costs,  as 
it  synchronizes  on  the  order  of  \fN  times  more  often  than  YOW. 

The  assumptions  under  which  we’ve  analyzed  YAWNS  show  that  if  simulation  time  ad¬ 
vances  by  exponentially  distributed  amounts  and  if  only  one  LP  is  assigned  to  each  processor, 
then  YAWNS  has  a  relatively  high  cost.  However,  YAWNS  performance  is  sensitive  to  both 
of  these  assumptions.  If  an  LP  s  service  time  is  bounded  below  by  7  >  0,  then  the  size  of  a 
YAWNS  window  at  least  7.  This  seemingly  minor  change  of  assumptions  defeats  the  assured 


15 


asymptotic  superiority  of  YOW,  because  it  changes  YAWNS  0(\/N)  cost  to  0(  1/7).  The 
relative  performance  of  YAWNS  and  YOW  depend  primarily  then  on  cv,  /?,  and  7. 

Next  we  show  that  by  considering  the  effects  of  aggregating  LPs  onto  processors,  YAWNS 
again  circumvents  YOW’s  assured  superiority,  even  if  service  times  are  exponentially  dis¬ 
tributed.  The  reasoning  is  straightforward.  Let  N  denote  the  number  of  LPs,  P  denote  the 
number  of  processors,  and  presume  that  each  processor  simulates  N/ P  LPs.  The  average 
size  of  a  YAWNS  window  is  y{N)  =  1.25 fis/y/N;  the  number  of  events  each  LP  executes 
in  a  window  is  Poisson  with  rate  2y(N).  Since  LPs  are  independent,  the  number  of  events 
a  processor  executes  each  window  is  Poisson  with  rate  X(N)  =  2.5 \f~N  /  P.  If  Mp(\)  is  the 
mean  expected  maximum  of  P  Poissons  with  rate  A,  then  YAWNS’  cost  per  unit  simulation 
time  per  co-resident  LP  is 


Dyawns{N)  —  X 


s/N  MP(\(N))  +  B 


N/P 


Eick  et  al.  study  the  asymptotics  of  Afp(r),  showing  that  Mp(r)  ~  log  P/  log  log  P  for  small 
r,  and  Mp(r)  ~  2r  for  r  =  f!(log  P).  X(N)  increases  unboundedly  in  N,  implying  that  for 
sufficiently  large  N 


D 


yawns 


(N) 


2A  (N)  +  B 
N/P 
B 

A  (NY 


=  4  + 


The  second  term  vanishes  as  N  grows,  showing  that  YAWNS’  normalized  execution  cost  per 
LP  is  asymptotically  constant. 

The  result  above  does  not  imply  that  YAWNS’  normalized  cost  is  asymptotically  4  be¬ 
cause  constants  in  the  asymptotic  analysis  are  missing  from  our  expressions.  However,  Fig¬ 
ure  7  plots  the  predicted  cost  (not  asymptotic)  as  a  function  of  log  (N/P),  assuming  P  =  16 
and  B  =  0.  It  also  plots  the  predicted  performance  of  YOW,  again  assuming  A  =  ps,  under 
the  same  values  of  N  and  P.  State-saving  overhead  factors  of  a  =  1, 1.2  and  1.5  are  shown. 
These  figures  are  obtained  by  computing  appropriate  convolutions  of  W,  and  finding  the 
expected  maximum  convolved  processor  load.  Since  aggregation  may  change  the  relative 
optimality  for  YOW  of  A  =  /z,,  we  also  computed  costs  assuming  other  window  sizes.  Differ¬ 
ences  from  the  presented  data  were  small.  Assuming  that  synchronization  costs  contribute 
little  to  the  overhead  cost  under  high  loads,  it  is  clear  now  that  YAWNs  can  do  better  than 
YOW  under  high  degrees  of  aggregation,  or  when  state-saving  overhead  is  significant. 

It  should  also  be  noted  that  our  model  assumptions  work  against  YOW  in  the  aggregated 
case.  When  LPs  te?id  to  communicate  with  other  LPs  on  the  same  processor  one  may  expert 
advantages  due  to  significantly  reduced  communication  costs.  This  is  especially  true  in  our 
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Figure  7:  YAWNS  and  YOW  normalized  cost  per  unit  simulation  time  under  aggregation  as 
a  function  of  Iog(yV/P). 

model  because  the  recomputation  cost  due  to  delayed  stragglers  is  consequential.  However, 
the  assumption  that  messages  are  routed  uniformly  at  random  means  that  no  such  locality 
is  present  in  the  model.  Our  costing  assumptions  remain  valid  in  the  aggregated  case  so 
long  as  event  processing  costs  are  of  the  same  order  as  communication  and  the  window  size 
is  small. 


5  Conclusions 

We  have  analyzed  a  simple  model  of  parallel  simulation,  to  assess  the  benefit  of  adding 
optimism  to  an  existing  conservative  synchronization  protocol,  YAWNS.  Our  approach  is 
novel  to  the  the  problem  area,  and  is  relatively  simple.  We  show  how  to  compute  approximate 
probability  distributions  of  processor  workload.  To  these  distributions  we  add  overheads  due 
to  state-saving,  and  synchronization.  In  addition,  we  consider  the  effects  on  performance 
due  to  aggregating  many  LPs  onto  a  processor. 

The  extension,  YOW,  remains  window-based;  our  analysis  predicts  that  there  is  some 
optimally-size  window,  a  prediction  borne  out  by  experiments.  The  window  is  relatively  large 
compared  to  YAWNS’,  but  is  still  so  small  that  on  average  a  logical  processor  executes  only 
two  events  within  it.  Using  this  window  size  we  construct  equations  predicting  YOW’s  and 
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YAWNS’  execution  cost  per  unit  simulation  time,  and  observe  that  under  the  assumption 
of  one  LP  per  processor,  YOW  is  asymptotically  better  than  YAWNS,  as  the  number  of 
LPs  grows.  However,  when  we  analyze  performance  allowing  many  LPs  per  processor  we 
find  that  YAWNS  does  better  than  YOW  under  moderate  levels  of  aggregation,  or  when 
state-saving  costs  are  non-negligible. 

Far-reaching  quantitive  conclusions  are  questionable  for  a  model  of  this  type.  For  both 
YAWNS  and  YOW  small  changes  in  model  assumptions  will  significantly  affect  quantitative 
results.  Qualitatively  though  we  may  infer  that  if  actual  reprocessing  costs  resemble  those 
in  our  model  and  global  synchronization  costs  aren’t  high,  then  it  is  likely  that  limiting 
optimism  is  a  good  thing  in  a  window-based  framework.  We  also  conclude  that  if  probability 
distributions  driving  simulation  time  advance  have  no  lower  support,  then  YAWNS  will  not 
do  well  when  the  problem  is  sparse  relative  to  the  architecture.  However,  this  problem 
disappears  for  large  problems  where  LPs  are  highly  aggregated  onto  processors.  Perhaps  the 
strongest  conclusion  we  offer  is  that  performance  of  parallel  simulations  is  more  strongly  a 
function  of  state-saving,  synchronization/communication  costs,  problem  size,  and  degree  of 
aggregation  than  it  is  for  the  specific  synchronization  protocols.  Synchronization  methods 
ought  to  be  chosen  after  the  problem  is  known,  and  to  take  advantage  of  the  problem’s 
characteristics. 

An  open  and  important  question  remains,  whether  a  window-based  framework  offers 
better  performance  than  a  completely  asynchronous  one.  While  we  have  not  addressed  this 
problem,  we  believe  that  extension  of  our  analytic  approach  to  the  Gupta  et  al.  model 
assumptions  may  lead  to  the  desired  comparison.  We  also  believe  a  more  precise  treatment 
of  the  effects  of  communication  delay  is  possible,  which  will  lead  to  better  understanding  of 
the  effect  the  underlying  architecture  has  synchronization  behavior. 
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